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Abstract 



We propose an algorithm for solving nonlinear convex programs defined in terms of a sym- 
r) , metric positive semidefinite matrix variable X. This algorithm rests on the factorization 

^J . X = YY , where the number of columns of Y fixes the rank of X. It is thus very effective for 

,J^ . solving programs that have a low rank solution. The factorization X = YY^ evokes a refor- 

C^ . mulation of the original problem as an optimization on a particular quotient manifold. The 

present paper discusses the geometry of that manifold and derives a second order optimization 
method. It furthermore provides some conditions on the rank of the factorization to ensure 
^ ' equivalence with the original problem. The efficiency of the proposed algorithm is illustrated 

^T^ ■ on two applications: the maximal cut of a graph and the sparse principal component analysis 

'sj" ■ problem. 

o 

oo . 

O ■ 1 Introduction 

l^ ' Many combinatorial optimization problems can be relaxed into a convex program. These 

^ ' relaxations are mainly introduced as a tool to obtain lower and upper bounds on the problem 

of interest. The relaxed solutions provide approximate solutions to the original program. 
Even when the relaxation is convex, computing its solution might be a demanding task in 
the case of large-scale problems. In fact, most convex relaxations of combinatorial problems 
consist in expanding the dimension of the search space by optimizing over a symmetric pos- 
itive semidefinite matrix variable of the size of the original problem. Fortunately, in many 
cases, the relaxation is tight once its solution is rank one, and it is expected that the convex 
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relaxation, defined in terms of a matrix variable that is likely to be very large, presents a 
low-rank solution. This property can be exploited to make a direct solution of the convex 
problem feasible in large-scale problems. 

The present paper focuses on the convex optimization problem, 

min fix) 

s.t. Tv{AiX) = bi, Ai£S"',bi£R, i = l,...,m, (1) 

XhO, 

where the function / is convex and S" = {X G M"'^"'!^^ = X} denotes the set of the 
symmetric matrices of R""^". In general, the solution of this convex program has to be 
searched in a space of dimension —^ — - . An approach is proposed for solving ([T|) that is able 
to deal with a large dimension n once the following assumptions hold. 

Assumption 1 The program ([7P presents a low-rank solution X* , i.e., 

rank(X*) = r '^n. 
Assumption 2 The symmetric matrices Ai satisfy 

AiAj = (J, 
for any i,j € {1, . . . , m} such that i ^ j. 
Assumption 2 is fulfilled, e.g., by the spectahedron, 

s = {x e s"|a: h 0, Tr(A:) = i}, 

and the elliptope^l 

£ = {X e S"|X t 0, diag(A:) = 1}. (2) 

Assumption 1 suggests to factorize the optimization variable X as 

X = YY^ (3) 

with Y £ M"^P. This leads to a nonlinear optimization program in terms of the matrix Y, 
min f(YY'^) 



S.t. Tr(y'^yliy) = hi, ^j € §", 6i € M, i = 1, . . . , m. 

Program ^ searches a space of dimension np, which is much lower than the dimension of 
the symmetric positive semidefinite matrices X. However, this program is no longer convex. 

A further potential difficulty of the program ^ is that the solutions are not isolated. For 
any solution Y and any orthogonal matrix Q of M^^^, i.e., Q^Q = I, the matrix YQ is also a 



^The elliptope is also known as the set of correlation matrices. 



solution. In other words, the program (j3|) is invariant by right multiphcation of the unknown 
with an orthogonal matrix. This issue is not harmful for simple gradient schemes but it 
greatly affects the convergence of second order methods (see e.g., jAMSOSj and [AIDV08] ) . 
In order to take into account the inherent symmetry of the solution, the algorithm developed 
in this paper does not optimize on the Euclidean space M"^^. Instead, one considers a search 
space, whose points are the equivalence classes {l^QIQ £ M^^^,Q Q = ^}- The minimizers 
of dU are isolated in that quotient space. 

It is important to mention that the rank r of the solution X* is usually unknown. The 
algorithm we propose for solving ([1]) thus provides a method that finds a local minimizer y* 
of ^ with an approach that increments p until a sufficient condition is satisfied for Y^, to 
provide the solution Y^YJ' of (1). The proposed algorithm converges monotonically towards 
the solution of dTJ, is based on superlinear second order methods, and is provided with an 
indicator of convergence able to control the accuracy of the results. 

The idea of reformulating a convex program into a nonconvex one by factorization of the 
matrix unknown is not new and was investigated in |BM03j for solving semidefinite programs 
(SDP). While the setup considered in |BM03j is general but restricted to gradient meth- 
ods, the present paper further exploits the particular structure of the equality constraints 
(Assumption 2) and proposes second-order methods that lead to a descent algorithm with 
guaranteed superlinear convergence. The authors of |GP07j also exploit the factorization ^ 
to efficiently solve optimization problems that are defined on the elliptope ([2|). Whereas the 
algorithms in |GP07j evolve on the Cholesky manifold — a submanifold of M"^^ whose inter- 
section with almost all equivalence classes is a singleton — , the methods proposed here work 
conceptually on the entire quotient space and numerically in M^^^, using the machinery of 
Riemannian submersions. 

The paper is organized as follows. Section [3] derives conditions for an optimizer of ([!]) to 
represent a solution of the original problem ([T]) . A meta-algorithm for solving ([T]) based on the 
factorization ^ is built upon these theoretical results. Section[3]describes the geometry of the 
underlying quotient manifold and proposes an algorithm for solving ^ based on second order 
derivative information. Sections [5] and [6] illustrate the new approach on two applications: the 
maximal cut of a graph and the sparse principal component analysis problem. 

2 Notations 

Given a function / : S" — > M : X h^ /(^); we define the function 

/ : ]^nxp _^^.Y^ f{Y) = f{YY^). 

The operator V- stands for the first order derivative, i.e., the matrix B = Vx/(^o) represents 
the gradient of / with respect to the variable X evaluated at the point Xq. f is assumed to 



be differentiable and B is defined element wise by 

dl 

Finally, 



Bi,j — -^^ — (-'^o)- 



«./(X„)[Zl = li,n/<^° + «^)-^(^'»' 



t^O t 

denotes the derivative with respect to X of the function / at the point Xq in the direction 
Z. It holds that 

Dxf{Xo)[Z] = {Vxf{Xo),Z), 

where (•, •) denotes the Frobenius inner product {Zi, Zq) = Tt{ZJ Z2). 

3 Optimality conditions 

This section derives and analyzes the optimality conditions of both programs ([1]) and (j3]). 
These provide theoretical insight about the rank p at which (j3|) should be solved as well 
as conditions for an optimizer of ([3]) to represent a solution of the original problem ((!]). A 
meta-algorithm for solving ([1]) is then derived from these results. 

3.1 First-order optimality conditions 

Lemma 1 A symmetric m,atrix X ^E^ solves ([ip if and only if there exist a vector a G M™" 
and a symmetric matrix 5" E S" such that the following holds, 

Tr(AiX) = bi, 

X^O, 

StO, (5) 

SX = 0, 

S = Vxf{X)-ET=l^^Ai. 

Proof. These are the first order KKT-conditions, which are necessary and sufficient in case 
of convex programs |BV04j . D 

Lemma 2 If Y is a local optimum of ^, then there exists a vector A G M*" such that 

Ti{Y^AiY) = bi, 

{Vxf{YY^)-ZT=l>^^A^)Y = 0. ^^' 

If the { A;^}i=i,...,m O'fs linearly independent, the vector A is unique. 

Proof. These are the first order KKT-conditions for the program ^. D 

Given a local minimizer y of (HD , one readily notices that all but one condition of Lemma 
[J hold for the symmetric positive semidefinite matrix yy^. Comparison of Lemma 1 and 
Lemma 2 therefore provides the following relationship between the nonconvex program ^ 
and the convex program ([!]). 



Theorem 3 A local minimizer Y of the nonconvex program ([^ provides the solution YY^ 
of the convex program HP if and only if the matrix 

771 

SY = Vxf{YY^)-^XiAi (7) 

is positive semidefinite for the Lagrangian multipliers Aj that satisfy (^. 

Proof. Check the conditions of Lemma [1] for the tuple {X, S, a} = {YY'^, S'y , A}. D 

It is important to note that, under Assumption 2, the Lagrangian multipUers in ([6]) have 
the closed-form expression, 

_ TTiY^AiVxfiYY^)Y) 

Hence, the dual matrix Sy in ([7j) can be explicitly evaluated at an optimizer y of (|4]). 

3.2 Second-order optimality conditions 

Let C{Y, A) denote the Lagrangian of the nonconvex program ^, i.e., 

m 

£{Y, A) = f{YY^) - Y, \(Tr(y^A,y) - bi). 

i=l 

In the following, the Lagrangian multipliers A are assumed to satisfy ([6]). A necessary condi- 
tion for Y E M^^P to be optimal is that it is a critical point, i.e., Vy£(y, A) = 0. 

Lemma 4 For a minimizer Y G M"^^ of ^, one has 

Tt{Z^DyVyC{Y, \)[Z]) > (9) 

for any matrix Z G M"^^ that satisfies, 

Tr{Z'^AiY) = 0, i = l,...,m. (10) 

Proof. These are the second order KKT-conditions of the program ^ . D 

Lemma 5 Because of the convexity of f{X), one always has 

]P^{Z'^DyVyC{Y, \)[Z\) = Tt{Z'^SyZ) + a (11) 

with a > and for any matrix Z that satisfies < tJO|) . The term a cancels out once YZ = 0. 
Proof. By noting that Vy£(y, A) = 2SyY , one has 

^Tr(Z'^Z?yVy£(y,A)[Z]) = 

m 

Tr(Z^5yZ) +Tr(Z^Z)y(Vx/(yy^))[^]y) - J]DyAi[Z]Tr(Z^^iy). (12) 



The last term of (fT^ cancels out by virtue of pUj) and the convexity of the function f{X) 
ensures the second term of (fT2]) to be nonnegative, i.e., 



Tr{Z^DY{Vxf{YY^))[Z]Y) = ^Tt{{YZ^ + ZY^)DYiVxf{YY^))[Z]) 

= ^TT{W^Dx{Vxf)[W]) 
>0, 

where X = YY'^ and W = Y Z'^ + ZY'^ £ Sn- □ 



Theorem 6 A local minimizer Y of the program ^ provides the solution X = YY^ of the 
program (QP if it is rank deficient. 

Proof. For the matrix Y G M"^^ to span a r-dimensional subspace, the following factorization 
has to hold, 

Y = YM'^, (13) 

with Y G M"^'' and M a full rank matrix of W^"^. Let M± G MPx(p-'") be an orthogonal basis 
for the orthogonal complement of the column space of M, i.e., M M_l = and Mj_M± = I. 
For any matrix Z G W^^(p~^) ^ the matrix Z = ZMj_ satisfies 

YZ^ = 

such that the conditions (jlOp hold and a cancels out in (jlip . Thus, by virtue of Lemmas H] 
andO 

TriZ^SyZ) > 0, 

for matrices Z = ZMJ_, i.e., the matrix Sy is positive semidefinite and X = YY^ is a solution 
of the problem ([T|). D 



Corollary 7 In the case p = n, any local minimizer Y G ]^"-x" gf ^/jg program ^ provides 
the solution X = YY^ of the program (Op. 

Proof. If Y is rank deficient, the matrix X = YY is optimal for ([1]) by virtue of Theorem [6j 
Otherwise, the matrix Sy is zero because of the second condition in ^ and X is optimal for 
©. □ 



3.3 An algorithm to solve the convex problem 

The proposed algorithm consists in solving a sequence of nonconvex problems @ of increasing 
dimension until the resulting local minimizer Y represents a solution of the convex program 
([1]). Both Theorems [3] and [6] provide conditions to check this fact. When the program (JH) is 
solved in a dimension p smaller than the unknown rank r, none of these conditions can be 



fulfilled. The dimension p is thus incremented after each resolution of ([!]). In order to ensure 
a monotone decrease of the cost function through the iterations, the optimization algorithm 
that solves (j3]) is initialized with a matrix corresponding to Y with an additional zero column 
appended, i.e., Iq = [^|0]- Since this initialization occurs when the local minimizer Y £ W^^p 
of ^ does not represent the solution of ([T]), Yq is a saddle point of the nonconvex problem 
for the dimension p + 1. This can be a critical issue for many optimization algorithms. 
Fortunately, in the present case, a descent direction from Yq can be explicitly evaluated. For 
Lemma O the matrix Z = [0\v], for instance, where is a zero matrix of the size of Y and v 
is the eigenvector of Sy related to the smallest algebraic eigenvalue verifies. 



1 



Tt{Z^DyVyC{Yo, X)[Z]) = v'^Syv < 0, 



since YqZ^ = for the Lagrangian multipliers A given in ([8]). All these elements lead to the 
meta-algorithm displayed in Algorithm [TJ The parameter e fixes a threshold on the eigen- 



values of Sy to decide about the nonnegativity of this matrix, e is chosen to 10 in our 
implementation. 



Algorithm 1: Meta-algorithm for solving the convex program ([T]) 
input : Initial rank pQ, initial iterate Y^^> G ]^"xpo r^j-^j^ parameter e. 
output: The solution X of the convex program ([T|). 

begin 

pi — po 

Yp ^ y(o) 

stop < — 

while stop 7^ 1 do 

Initialize an optimization scheme with Yp to find a local minimum Y* of (JH) by 

exploiting a descent direction Zp if available. 

if p = Po and rank(yi*) < p then 



stop = 1 



else 



Find the smallest eigenvalue An 
matrix Sy ^. 
if Amir, > —e then 



and the related eigenvector Vtnin of the 



I stop 

else 

pi — 



Y 



p 



= 1 

P + l 
- [Yp*\0] 



A descent direction from the saddle point Yp is given by Zp = [0|Vm 



X 



^ p ^ p 



end 



^ A Matlab implementation of Algorithm [T] with the manifold-based optimization method of Section |4] can 
be downloaded from http://www.montefiore.ulg.ac.be/~journee. 



It should be mentioned that, to check the optiniahty for the convex program ([T|) of a local 
minimizer Y* the rank condition of Theorem [6] is computationally cheaper to evaluate than 
the nonnegativity condition of Theorem [3j Nevertheless, the rank condition does not provide 
a descent direction to escape saddle points. It furthermore requires to solve the program @ 
at a dimension that is strictly greater than r, the rank of the solution of ([H). Hence, this 
condition is only used at the initial rank po and holds if po is chosen larger than the unknown 
r. Numerically, the rank of Y* is computed as the number of singular values that are greater 
than a threshold fixed at 10~^. The algorithm proposed in |BM03| exploits exclusively the 
rank condition of Theorem [6l For this reason, each optimization of ^ has to be randomly 
initialized and the algorithm in |BM03j is not a descent algorithm. 

By virtue of Corollary [71 Algorithm [J stops at the latest once p = n. The applications 
proposed in Sections and E] indicate that in practice, however, the algorithm stops at a rank 
p that is much lower than the dimension n. If po < f^ then the algorithm stops once p equals 
the rank r of the solution of ([T]). These applications also illustrate that the magnitude of 
smallest eigenvalue Amin of the matrix Sy can be used to monitor the convergence. The value 
|Amin| indicates whether the current iterate is close to satisfy the KKT conditions ([5]). This 
feature is of great interest once an approximate solution to ([T|) is sufficient. The threshold e 
set on Amin controls then the accuracy of the result. 

A trust-region scheme based on second-order derivative information is proposed in the 
next section for computing a local minimum of (JH). This method is provided with a conver- 
gence theory that ensures the iterates to converge towards a local minimizer. 

Hence, the proposed algorithm presents the following notable features. First, it converges 
toward the solution of the convex program ([T|) by ensuring a monotone decrease of the cost 
function. Then, the magnitude of the smallest eigenvalue of Sy provides a mean to monitor 
the convergence. Finally, the inner problem ^ is solved by second-order methods featuring 
super linear local convergence. 

4 Manifold-based optimization 

We now derive an optimization scheme that solves the nonconvex and nonlinear program, 

min /(Y) 
s.t. Tr{Y^AiY) = k, ^i e §", 6i G M, i = 1, . . . , m, 

where f(Y) = f(YY^) for some / : §" ^ M. 

As previously mentioned. Program (|14|) is invariant by right-multiplication of the variable 
Y by orthogonal matrices. The critical points of (|14p are thus non isolated. The proposed 



algorithm exploits this symmetry by optimizing the cost /(•) on the quotient 

M = M/Op, 

where Op = {Q £ Wp\Q^Q = 1} is the orthogonal group and M = {Y £ M^^ : 
TT{Y^AiY) = bi, i = l,...,m} is the feasible seto Each point of the quotient M is an 
equivalence class 

[Y] = {YQ\Q G Op]. (15) 

It can be proven that the quotient M presents a manifold structure [AMS08J . Program (|14p 
is thus strictly equivalent to the optimization problem, 

min /([y]), 

for the function J : M ^m. : [Y] ^ f{\Y\) = f{Y). 

Several unconstrained optimization methods have been generalized to search spaces that 
are differentiable manifolds. This is, e.g., the case of the trust-region approach. Details on 
this algorithm can be found in |ABG07| IAMS08J . It is important to mention that this al- 
gorithm is provided with a convergence theory whose results are similar to the ones related 
to classical unconstrained optimization. In particular, trust-region methods on manifolds 
converge globally to stationary points of the cost function if the inner iteration produces a 
model decrease that is better than a fixed fraction of the Cauchy decrease; such a property is 
achieved, e.g., by the Steihaug-Toint inner iteration. Since the iteration is moreover a descent 
method, convergence to saddle points or local maximizers is not observed in practice. It is 
possible to obtain guaranteed convergence to a point where the second-order necessary con- 
ditions of optimality hold, by using inner iterations that exploit the model more fully (e.g., 
the inner iteration of More and Sorensen), but these inner iterations tend to be prohibitively 
expensive for large-scale problems. For appropriate choices of the inner iteration stopping 
criterion, trust-region methods converge locally superlinearly towards the nondegenerate lo- 
cal minimizers of the cost function. The parameter 9 in Equation (10) of |ABG07J has been 
set to one, which guarantees a quadratic convergence. 

A few important objects have to be specified to exploit the trust-region algorithm of 
|ABG07J in the present context. First, the tangent space at a point Y of the manifold TW, 

TyM = {Z e M"^P : Ti{Y^A,Z) = 0, i = 1, . . . , m}, 

has to be decomposed in two orthogonal subspaces, the vertical space VyAi and the horizontal 
space TCyM- The vertical space VyM corresponds to the tangent space to the equivalence 
classes, 

VyM = {Yn : n £ Rp^'p, n'^ = -n}. 



R"^'' is the noncompact Stiefel manifold of full-rank n x p matrices. The nondegeneracy condition is 
required to deal with differentiable manifolds. 



The horizontal space TCy-M is the orthogonal complement of VyTW in TyM., i.e., 

HyM = {Z e TyM : Z^Y = Y^Z}, (16) 

for the Euclidean metric {Zi,Z2) = Ti{Zj Z2) for all Zi,Z2 G TyA4. Expression (|16p results 
from the equality Tr(5r2) = that holds for any symmetric matrix S and skew-symmetric 
matrix Vt of compatible dimension. 

Let NyM, the normal space to M at Y , denote the orthogonal complement of TyM in 
M"^P, i.e., NyM = {YT=i ctiAiY, a e M""}. Hence, the Euclidean space M"^^ can be divided 
into three mutually orthogonal subspaces, 



nnxp 



HyM © VyM © NyM. 



The trust-region algorithm proposed in |ABG07J requires a projection PYi') from M"^^ to 
T-CyM along VyM © NyM. The following theorem provides a closed- form expression. 

Theorem 8 Let Y be a point on M. For a matrix Z £ M^^^, the projection Py{') '■ K"^^ -^ 
TCyM is given by 

m 

PY{Z)=Z-Yn-Y,aiAY, 

i=l 

where il is the skew symmetric matrix that solves the Sylvester equation 

nY'^Y + y'^yq = y'^z - Z'^Y, 

and with the coefficients 

_ Tr{Z^A,Y) 
on 



Ti{YTA^Y) ■ 
Proof. Any vector Z G M"^^ presents a unique decomposition 

Z = ZvyM + ZfiyM + ^NyM' 

where each element Z;^ belongs to the Euclidean space X. The orthogonal projection 'Py(-) 
extracts the component that lies in the horizontal space, i.e.. 



PYiZ) = Z-Yn-J2(^iAiY, 



i=l 

with 0, a skew symmetric matrix. The parameters Q and a are determined from the linear 
equations 

y'^Py(Z) = PY{ZfY, 
TT{Y^AiPY{Z)) = 0, i = l...m, 

which are satisfied by any element of the horizontal space. D 



The projection PYi') provides simple formulas to compute derivatives of the function / 
(defined on the quotient manifold) from derivatives of the function / (defined in the Euclidean 
space). The gradient corresponds to the projection on the horizontal space of the gradient of 
the function f(Y), i.e., 

grad/(y)=Py(Vy/(y)). 

The Hessian applied on a direction Z G TCyM is given by 

Hess/(y)[Z] = PyiDyiPYiVYfiYmZ]), 
where the directional derivative -Dy (•)[•] is performed in the Euclidean space K^^^. 

Finally, a last ingredient needed by the trust-regions algorithm in |ABG07j is a retraction 
7^y(-) that maps a search direction Z (an element of the horizontal space at Y) to a matrix 
representing a new point on the manifold M. Such a mapping is for example given by the 
projection of the matrix Y = Y + Z along the Euclidean space NyM, i.e., 

m 

ny{Z) = [Y + Y,aiAiY], (17) 

where [•] denotes the equivalence class p5p and the coefficients a^ are chosen such that 

Tr(y^A,F) = bi, 

with Y = Y + Yl^i o^i-^iY ■ Under Assumption 2, the coefficients ctj are easily computed as 
the solution of the quadratic polynomial, 

af Tr(y^^fy) + 2ail^{Y^ A^iY) + Ti{Y^ AiY) = bi. 

In case of the elliptope £, Equation (fT7|) becomes, 

7^y(z) = [Diag((y + z){Y + zfr'HY + Z)], 

where Diag(X) denotes the diagonal matrix whose diagonal elements are those of X and the 

brackets refer to the equivalence class (fT5|) . For the spectahedron S, the retraction (fT7|) is 

given by 

. ^ \ Y+Z 

ny{Z) 



yTV((y + z)^(y + z))_ 

The complexity of the manifold-based trust-region algorithm in the context of program 
(fT^ is dominated by the computational cost required to evaluate the objective /(y), the gra- 
dient '^yfi^) and the directional derivative Dy{SJyf(Y))[Z]. Hence, the costly operations 
are performed in the Euclidean space R"'^^, whereas all manifold-related operations, such as 
evaluating a metric, a projection and a retraction, are of linear complexity with the dimension 
n. 



5 Optimization on the elliptope: the max-cut SDP relaxation 

A first application of the proposed optimization method concerns the maximal cut of a graph. 

The maximal cut of an undirected and weighted graph corresponds to the partition of the 
vertices in two sets such that the sum of the weights associated to the edges crossing between 
these two sets is the largest. Computing the maximal cut of a graph is a NP-complete 
problem. Several relaxations to that problem have been proposed. The most studied one 
is the 0.878-approximation algorithm [GW95] that solves the following semidefinite program 
(SDP), 

min Tr(AX) 

s.t. diag(X) = 1, (18) 

XhO, 

where A = —jL with L the Laplacian matrix of the graph and 1 is a vector of all ones. This 
relaxation is tight in case of a rank-one solution. 

As previously mentioned, the elliptope, 

£ = {X e S"|A: h 0,diag(A:) = 1}, 

satisfies Assumption 2. Hence, Program (|18p is a good candidate for the proposed framework. 
Using the factorization X = YY^ , the optimization problem is defined on the quotient 
manifold ^As = Ms /Op, where 

Me = {Y (^ M^^P : diag(yy^) = 1}. 

The complexity of Algorithm [1] in the present context is of order 0{n?p). This complexity 
is dominated by both the manifold-based optimization and the eigenvalue decomposition of 
the dual variable Sy-, that are 0{n?p). The computational cost related to the manifold-based 
optimization is however reduced in case of matrices A that are sparse. 

Table [T] presents computational results obtained with Algorithm [T] for computing the max- 
imal cut of a set of graphs. The parameter n denotes the number of vertices of these graphs 
and corresponds thus to the size of the variable X in (jlSp . More details on these graphs can 
be found in |BM03j and references therein. The low-rank method is compared with the SD- 
PLR algorithm proposed in |BM03j . that also exploits the low rank factorization X = YY^ 
in the case of semidefinite programs (SDP). The rank of the optimizer Y* indicates that 
the factorization X = YY reduces significantly the size of the search space. Concerning 
the computational time, it is important to mention that Algorithm [T] has been implemented 
in Matlab, whereas a C implementation of the SDPLR algorithm has been provided by the 
authors of [BM03J . Although this renders a rigorous comparison of the computational load 
difficult, Table [U suggests that both methods perform similarly. 









Objective values 


CPU time (sec) 


Graph 


n 


Rank(y*) 


Aigo. m 


SDPLR 


Aigo. m 


SDPLR 


toruspm3-8-50 


512 


8 


-527.81 


-527.81 


17 


3 


toruspm3-15-50 


3375 


15 


-3474.79 


-3474.76 


1051 


181 


torusg3-8 


3375 


7 


-3187.61 


-3188.09 


375 


228 


Gl 


800 


13 


-12083.2 


-12083.1 


57 


35 


Gil 


800 


5 


-629.16 


-629.15 


53 


15 


G14 


800 


13 


-3191.57 


-3191.53 


82 


13 


G22 


2000 


18 


-14136.0 


-14135.9 


358 


101 


G32 


2000 


5 


-1567.58 


-1567.57 


158 


69 


G35 


2000 


14 


-8014.57 


-8014.33 


525 


68 


G36 


2000 


13 


-8005.60 


-8005.80 


459 


115 


G58 


5000 


8 


-20111.3 


-20135.4 


1881 


1119 



Table 1: Computational results of Algorithm [T] (implemented in Matlab) and the SDPLR 
algorithm (implemented in C) on various graphs. 



Figure [T] depicts the monotone convergence of the Algorithm [T] for the graph toruspm3- 
15-50. The number of iterations is displayed on the bottom abscissa, whereas the top abscissa 
stands for the rank p. Figure [2] indicates that the smallest eigenvalue Amin of the dual matrix 
Sy monotonically increases to zero. One notices that the magnitude of Amin gives some insight 
on the current accuracy. 
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Figure 1: Monotone decrease of the cost function f{Y) = Tr{Y^ AY) through the iterations 
(bottom abscissa) and with the rank p (top abscissa) in the case of the graph toruspm3-15-50. 
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Figure 2: Evolution of the smallest eigenvalue of the matrix Sy (case of the graph toruspmS- 
15-50). 

6 Optimization on the spectahedron: the sparse PCA problem 

This section presents three nonlinear programs that concern the sparse principal component 
analysis problem and that can be efficiently solved with the proposed low-rank optimization 
approach. 

Principal component analysis (PCA) is a tool that reduces multidimensional data to lower 
dimension. Given a data matrix A E M™^", the first principal component consists in the best 
rank-one approximation of the matrix A in the least square sense. This decomposition is 
performed via estimation of the dominant eigenvector of the empirical covariance matrix 
S = A^A. In many applications, it is of great interest to get sparse principal components, 
i.e., components that yield a good low-rank approximation of A while involving a limited 
number of nonzero elements. In case of gene expression data where the matrix A represents 
the expression of n genes through m experiments, getting factors that involve just a few 
genes, but still explain a great part of the variability in the data, appears to be a modelling 
assumption closer to the biology than the regular PCA [TJA"'"07J . This tradeoff between 
variance and sparsity is the central motivation of sparse PCA methods. More details on the 
sparse PCA approach can be found in |ZHT061 ldEJL07| and references therein. 



Sparse PCA is the problem of finding the unit-norm vector x £ M" that maximizes the 
Rayleigh quotient of the matrix S = A^A but contains a fixed number of zeros, i.e.. 



max x-^Sx 
s.t. x'^x = 1, 



(19) 



Card(x) < k, 
where k is an integer with 1 < k < n and Card(x) is the cardinality of x, i.e., the number of 



non zero components. Finding the optimal sparsity pattern of the vector x is of combinatorial 
complexity. Several algorithms have been proposed in the literature that find an approximate 
solution to (J19p . We refer to |dEJL07] for references on these methods. Let us finally mention 
that the data matrix A does not necessarily have to present a sparse pattern. In the context 
of compressed sensing, for example, one needs to compute the sparse principal component of 
a matrix A that is full and sampled from a gaussian distribution [dBEOTj . 

Recently, two convex relaxations have been derived that require to minimize some non- 
linear convex functions on the spectahedron S = {X € S™|X ^ 0, Tr(X) = 1}. Both of these 
relaxations consider a variation of (J19p . in which the cardinality appears as a penalty instead 
of a constraint, i.e., 

max x'^TiX — pCard(x) 



s.t. x'^x = 1, 



(20) 



with the parameter p >0. 



6.1 A first convex relaxation to the sparse PCA problem 

In [dE JL07 J . Problem (j20p is relaxed to a convex program in two steps. First, a convex 
feasible set is obtained by lifting the unit norm vector variable x into a matrix variable X 
that belongs to the spectahedron, i.e., 

max Tr(SX) - pCard(X) 

s.t. TV(X) = 1, (21) 



The relaxation (j2ip is tight for rank-one matrices. In such cases, the vector variable x in (j20p 
is related to the matrix variable X according to X = xx . Then, for (j2ip to be convex, the 
cardinality penalty is replaced by a convex li penalty, i.e., 

™ Tr(SX)-p^._^. I^ijl 

s.t. Tr(X) = 1, (22) 

X^O. 



Finally, a smooth approximation to (j22p is obtained by replacing the absolute value by the 
differentiable function hn{x) = \/x'^ + k? with the parameter ac that is very small. A too small 
n might, however, lead to ill-conditioned Hessians and thus to numerical problems. 



The convex program, 



s.t. Ti{X) = 1, (23) 

X^O, 



fits within the framework ([T]). The variable X is thus factorized in the product YY^ and the 
optimization is performed on the quotient manifold Aig = Mg/Op where 

Ms = {Y e m:^^p : T¥(y^y) = 1}. 

The computational complexity of Algorithm [T] in the context of program (|23|) is of order 
0{'n?p). It should be mentioned that the DSPCA algorithm derived in [dEJLOT] and that has 
been tuned to solve program ()22p features a complexity of order O(n^). 



Figure [3] illustrates the monotone convergence of Algorithm [T] on a random gaussian ma- 
trix A of size 50 X 50. The sparsity weight factor p has been chosen to 5 and the smoothing 
parameter k equals 10~^. The maximum of the nonsmooth cost function in (j22p has been 
computed with the DSPCA algorithm (dEJLOT] . One first notices that the smooth approxi- 
mation in (|23|) slightly underestimates the nonsmooth cost function (|22|) . The maximizers of 
both (|22p and (j23p are, however, almost identical. Then, we should mention that all numerical 
experiments performed with the DSPCA algorithm for solving ()22p resulted in a rank one 
matrix. So, the solution of (j23p is expected to be close to rank one. This explains why the 
improvement in terms of objective value is very small for ranks larger than one. A heuristic to 
speed up the computations would thus consist in computing an approximate rank one solution 
of (j23p . i.e.. Algorithm [1] is stopped after the iteration p = 1. Finally, on the right hand plot, 
Figure [3] highlights the smallest eigenvalue Amin of the matrix Sy as a way to monitor the 
convergence. 
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Figure 3: Left: monotone increase of f{Y) = Tt{Y'^IlY) — p'^Zi jhi^{0^^'^)ij) through the 
iterations (bottom abscissa) and with the rank p (top abscissa). The dashed horizontal line 
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smallest eigenvalue of Sy- 



Figure H] provides some insight on the computational time required by a Matlab imple- 
mentation of Algorithm [1] that solves (|23p . Square gaussian matrices A have been considered, 



i.e., m = n. On the left hand plot, Algorithm [T] is compared with the above mentioned 
heuristic and the DSPCA algorithm. The right hand plot highlights the quadratic complexity 
of Algorithm [1] with the problem size n. 
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Figure 4: Right: Computational time for solving (j23p versus the problem size in the case 
p = n. Left: Square root of the computational time versus n. 



6.2 A second convex relaxation to the sparse PCA problem 

Problem (j20p is shown in |dBE07j to be strictly equivalent to 

max YA-ii{aJz)'^ - p)+, 
s.t. z'^z = 1, 



(24) 



where Oj is the i**^ column of A and the function x+ corresponds to max(0, x). The auxiliary 
variable z enables to reconstruct the vector x: the component Xj is active if {of z)"^ — /? > 0. 
As for the relaxation previously derived in Section 16. H the vector z is lifted into a matrix Z 
of the spectahedron. 



m|x Er=i Tr(af Za, - /5)_ 
s. t. Tr(Z) = 1, 



(25) 



zz 



Program ([25 



This program is equivalent to (j24p in case of rank one matrices Z 

maximizes a convex function and is thus nonconvex. The authors of |dBE07j have shown 

that, in case of rank one matrices Z, the convex cost function in (I25p equals the concave 

function 

n 

/(Z)=J^Tr(Z^(afa,-p/)Z^)+, (26) 

i=l 



where the function Tr(X)_|_ stands for the sum of the positive eigenvalues of X. This gives 
the following nonsniooth convex relaxation of (j20p . 



max Er=i Tr(Z2 (a^ a, - pI)Z2)^ 

s. t. Tr(Z) = 1, (27) 

that is tight in case of rank-one solutions. This program is solved via the factorization 
Z = YY^ and optimization on the quotient manifold A^^. In the case Z = YY , function 
(p6|) equals 

n 

f{Y) = Y,T^T{Y^{afa,-pI)Y)+, 

i=l 

which is a spectral function [dBE07] . The evaluation of the gradient and Hessian of f{Y) 
are based on explicit formulae derived in the papers [LewQGj ILSOlj to compute the first and 
second derivatives of a spectral function. Since we are not aware of any smoothing method 
that would preserve the convexity of (j27p . Algorithm [1] has been directly applied in this non- 
smooth context. In practice, no trouble has been observed since all numerical simulations 
converge successfully to the solution of (j27p . The computational complexity of Algorithm [1] 
for solving (j27p is of order 0{nm'^p). The convex relaxation (j27p of the sparse PC A problem 
(pOj) appears thus well suited to treat large scale data with m <^ n, such as gene expression 
data are. 

Figure \5\ displays the convergence of Algorithm [T] for solving (|27p with a random gaussian 
matrix A of size m = 100 and n = 500. The sparsity parameter p is chosen at 5 percent of 
the upper bound p = max afai, that is derived in |dBE07j . The smallest eigenvalue Amin of 

i 

the matrix Sy presents a monotone decrease once it gets sufficiently close to zero. 

Figure [6] plots the CPU time required by a Matlab implementation of Algorithm [1] versus 
the dimension n of the matrix A. The dimension p has been fixed at 50 and A is chosen 
according to a gaussian distribution. Figure [6] illustrates the linear complexity in n of the 
proposed sparse PCA method. 

6.3 Projection on rank one matrices 



Both convex relaxations (j23p and (j27p are derived from the reformulation of a problem de- 
fined on unit norm vectors x into a problem with matrices X = xx , which is an equivalent 
formulation if X belongs to the spectahedron and has rank one. Within the derivation of 
both convex relaxations, the rank one condition has been dropped. The solutions of (|23p and 
()27p are therefore expected to present a rank larger than one. 



As previously mentioned, all numerical experiments performed with the DSPCA algorithm 
[dE JL07] ■ which solves the nonsmooth convex program (j22p . led to a rank one solution. Thus, 
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Figure 5: Left: monotone increase of the cost function through the iterations (bottom ab- 
scissa) and with the rank p (top abscissa). Right: evolution of the smallest eigenvalue of 
Sy. 



the solution of the smooth convex relaxation (j22p is expected to tend to a rank one matrix 
once the smoothing parameter k gets sufficiently close to zero. Figure [7] illustrates this fact. It 
should be mentioned that a matrix X of the spectahedron has nonnegative eigenvalues whose 
sum is one. Hence, X is rank one if and only if its largest eigenvalue equals one. In order to 
deal with potential numerical problems in case of very small k, we sequentially solve problems 
of the type of (j23p with a decreasing value of k. The solution of each problem initializes a 
new program ([23|) with a reduced k. 



In contrast to (j22p , the convex relaxation (j27p usually provides solutions with a rank that 
is larger than one. The solution matrix X has to be projected onto the rank one matrices of 
the spectahedron in order to recover a vector variable x. A convenient heuristic is to compute 
the dominant eigenvector of the matrix X. A vector x that achieves a higher objective value 
in (j20p might, however, be obtained with the following homotopy method. We consider the 
program 



max iJ.fcvx{Z) + (1 - fJ-)fccv{Z) 
s. t. Tr(Z) = 1, 



(28) 



with the concave function. 



and the convex function. 



fcUZ) = Y,MzHafai-pI)Z-2: 



i=l 



fcvx{Z) = ^Tr(afZai - /?)+, 



j=i 




Figure 6: Computational time for solving (j27p versus the problem size n in the case p 



50. 



and for the parameter < /i < 1. As previously mentioned, in case of rank one matrices 
Z = zz^ , the functions fccv{Z) and fcvx{Z) are identical and equal to the cost function (j24p . 
For n = 0, program (j28p is the convex relaxation (j27p and the solution has typically a rank 
larger than one. If /i = 1, solutions of ()28p are extreme points of the spectahedron, i.e., 
rank one matrices. Hence, by solving a sequence of problems ([28}) with the parameter fi that 
increases from zero to one, the solution of (j27p is projected onto the rank one matrices of 
the spectahedron. Program ()28p is no longer convex once fi > 0. The optimization method 
proposed in this paper then converges towards a local maximizer of ()28p . 



Figure [8] presents computational results obtained on a random gaussian matrix A G 
)x50 This projection method is compared with the usual approach that projects the 
symmetric positive semidefinite matrix Z onto its dominant eigenvector, i.e., Z = zz'^ where 
z is the unit-norm dominant eigenvector of Z. Let fEVoiZ) denotes the function|f| 



fEVniZ) = fccv{Z) = fcvx{Z). 

Figure [8] uses the maximum eigenvalue of a matrix Z of the spectahedron to monitor its rank. 
As previously mentioned, any rank one matrix Z of the spectahedron satisfies Aniax(^) = 1- 
The continuous plots of Figure [8] display the evolution of the functions fccv{Z) and fEVD{Z) 
during the resolution of the convex program (pTj) . i.e., /i = in (p8]) . Point A represents 
the solution obtained with Algorithm [1] by solving (j27p at the rank p = 1, whereas B and 
B' stands for the exact solution of (|27p . which is of rank larger than one. The dashed plots 
illustrate the effect of the parameter //, that is linearly increased by steps of 0.05 between the 
points B and C . For a sufficiently large ^, program (|28p presents a rank one solution, which 
is displayed by the point C. One clearly notices that the objective function of the original 
problem ([2^ . which equals fEvniZ), is larger at C than at B'. Hence, the projection method 



EVD stands for eigenvalue decomposition. 
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Figure 7: Left: evolution of the maximum the cost in (j23p with the smoothing parameter k. 
The dashed horizontal line represents the maximum of the nonsmooth cost function in (j22p . 
Right: evolution of the largest eigenvalue of the solution of (|23|) . 

based on (j28p outperforms the projection based on the eigenvalue decomposition of Z in terms 
of achieved objective value. 



7 Conclusion 

We have proposed an algorithm for solving a nonlinear convex program that is defined in 
terms of a symmetric positive semidefinite matrix and that is assumed to present a low-rank 
solution. The proposed algorithm solves a sequence of nonconvex programs of much lower 
dimension than the original convex one. It presents a monotone convergence towards the 
sought solution, uses superlinear second order optimization methods and provides a tool to 
monitor the convergence, which enables to evaluate the quality of approximate solutions for 
the original convex problem. The efficiency of the approach has been illustrated on several 
applications: the maximal cut of a graph and various problems in the context of sparse 
principal component analysis. The proposed algorithm can also deal with problems featuring 
a nonconvex cost function. It then converges toward a local optimizer of the problem. 
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